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Abstract. Partial wave analysis is a core tool in hadron spectroscopy. With the high 
statistics data available at facilities such as the Beijing Spectrometer III, this procedure becomes 
computationally very expensive. We have successfully implemented a framework for performing 
partial wave analysis on graphics processors. We discuss the implementation, the parallel 
computing frameworks employed and the performance achieved, with a focus on the recent 
transition to the OpenCL framework. 



1. Introduction 

The Beijing Electron-Positron Collider (BEPC) II is currently the world's highest luminosity 
machine in the r-charm energy region. The Beijing Spectrometer (BES) III experiment was 
designed to make the best possible use of this luminosity [1]; one of the core parts of the 
physics program [2] is a thorough study of the light hadron spectrum. For many final states, 
a wealth of, often broad, intermediate resonances will contribute to the amplitude. In order to 
disentangle interference effects and determine the spin and parity of the resonances, a partial 
wave analysis (PWA) is performed. As this involves the computation of the amplitude for every 
event in every iteration of a fit, this becomes computationally very expensive for large data 
samples. As events are independent and the amplitude calculation does not vary from event to 
event, this task is trivially parallelizable. This and the floating point intensity predestine PWA 
for implementation on graphics processing units (GPUs). We have successfully implemented a 
framework performing tensor calculations and partial wave flts on GPUs. Both the software 
architecture which completely encapsulates all the complexities of GPU based computing and 
the raw power of today's GPUs lead to significantly shortened analysis turnaround times. 



2. Partial wave analysis 

In a typical PWA (we use the radiative decay J/^ — > lX,X — K^K^ [3] as an example), 
the decay is modelled by coherently summing the contributions from a variety of intermediate 
resonances X. The relative magnitudes and phases of these resonances are determined from a 
fit and the fit result is compared to the data. The set of resonances and their properties are 
changed until a sufficient agreement with the data is found. 



The intensity / (relative number of events) at a particular point in phase space can be 
expressed as 
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where the sum runs over all intermediate resonances a, Va is the complex production amplitude 
for a and Aa{0,) the complex decay amplitude at a particular point in phase space. The 
likelihood for a particular model is 

where the product runs over all N^ata events in the sample and the integral is proportional to the 
total cross section, corrected for the detector efficiency e{Q). The integral is usually performed 
numerically by summing over a large number N^'q of simulated events (Monte Carlo, MC) 
generated evenly in phase space. The limited acceptance and efficiency of the detector can be 
taken into account by summing only over the iV^*^ simulated events that pass the reconstruction 
and analysis cuts. In an iterative fit, a minimum of the negative logarithm of the likelihood, 
corresponding to the best set of parameters for the used model is searched for; 
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The first sum runs over all data events, the second over all MC events. If the widths and masses 
of resonances are kept constant in the fit (i.e. the V^'s are the only free parameters), the last 
(inner) bracket and the ^Q,(r2j)yl*,(i7j) term for each data event can be pre-calculated. 

The number of floating point operations required is dominated by the sum over the data 
events and scales with N iterations x ^data X ^waves^ whilst the lookup table takes up storage 
space scaling with N^ata x ^ waves- The storage space problem can be addressed by increasing 
the memory of the relevant machine (~1.5 GB arc required per million events for a model with 
20 partial waves) , or with appropriate caching mechanisms for data samples with several million 
events. If the required floating point operations are performed sequentially, the time required 
can however become very long. There is no way of knowing whether the fit found just a local 
or the searched for global maximum of the likelihood. To gain confidence in the result, the fits 
are usually repeated with various sets of starting parameters. In addition, various models have 
to be tried out, especially in the study of possible new resonances and systematic effects. The 
thousands of fits needed in a typical partial wave analysis should thus be as fast as possible, 
especially as feedback from the physicist is required between the fits and they thus cannot easily 
be ran in parallel. 



3. GPU computing 

The advent of computer games with realistic three-dimensional depictions of a virtual world 
computed in real time has pushed the development of ever more powerful graphics processing 
units (CPUs). CPUs provide a large number of floating point units optimized for executing the 
same code on many sets of input data {Single Instruction Multiple Data SIMD). The theoretical 
performance of current CPUs surpasses 2 TFlop/s, at commodity prices of a few hundred dollars. 

Whilst early efforts for using graphics processors for general purpose computing relied on 
graphics interfaces such as OpenGL [4], several frameworks for easy access to GPU computing 
have become available in the meantime. Besides the vendor supplied frameworks {CUD A 



from Nvidia [5] and CAL/Brook+ from ATI [6]) there is now a vendor independent standard, 
OpenCL [7, 8]. Most of these frameworks provide lots of low level control on the granularity of the 
parallelism, memory management and so on. For typical high energy physics applications (and in 
particular PWA), this fine-grained control is not really needed, as events are fully independent. 
For these CclS6Sj cl framework with a high level of abstraction such as Brook+ is very convenient . 
As Brook+ is however not developed any further because AMD/ ATI concentrates on OpenCL, 
we built an abstraction layer on top of OpenCL. 

4. The GPUPWA framework 

We have developed a framework for GPU assisted partial wave analysis called GPUPWA. It 
supports amplitude creation and calculation in the covariant tensor formalism on the GPU as 
well as GPU based likelihood calculations. Sophisticated caching mechanisms ensure that the 
memory footprint on the GPU stays within the rather tight limits (especially if using commodity 
hardware). The ROOT framework [9] is used for data input and output, histogramming and, 
via the Minuit2 interface [10], for the minimisation process^. 

The framework completely abstracts all GPU interna from the user, who can program in pure 
C++, allowing for the creation of a PWA within hours (few days in cases with many complicated 
amplitudes) . More details on the interface and implementation (and also the striking similarities 
between PWA and game graphics calculations) can be found in [12], which describes the Brook+ 
implementation, which is very close to the current OpenCL version. 

Most of the calculations on the GPU are performed with single precision floating point 
numbers. The final sum however is always performed in double precision. For a large 
enough number of events the final precision is comparable to a fully double precision CPU 
implementation or even better, as the parallel tree summing employed on the GPU has less 
rounding issues than a traditional accumulator sum. 

5. Performance 

All the performance measurements quoted here were performed on an Intel Core 2 Quad 2.4 GHz 
workstation with 2 GB RAM running under Scientific Linux 5.2. The GPU is a ATI Radeon 4870 
with 512 MB RAM^. We compared a FORTRAN implementation used in BES II (without 
multi-threading and vectorization and fully in double precision) and the BROOK+ GPUPWA 
implementation of a J/ijj — > ^K'^K~ partial wave analysis, both minimizing using Fumili with 
analytical gradients and Hessian. The time for one fit iteration*^ was measured with the system 
timer for various sizes of the data samples (see figure 1). During the fit, data transfers over the 
PCIe bus are restricted to the fit parameters (to the GPU) and likelihoods, gradients and Hessian 
matrix (to the CPU); bus bandwidth is thus not the limiting factor. For large enough samples, 
the BROOK+ GPUPWA code is more than 150 times faster than the FORTRAN code. The 
same analysis was then ported to OpenCL in the AMD/ ATI implementation. Without further 
optimisation of the code, another 35% speed-up was obtained for large event numbers (see figure 
2); in part this is due to fewer restrictions in the size of memory allocations in OpenCL compared 
to Brook+, the rest is probably due to a more efficient compiler. 

^ We initially based our project on Brook+ and ATI hardware because at the time, only ATI allowed double 
precision calculations on consumer cards. This had the disadvantage that we were working with a rather 
experimental product with a small user base compared to Nvidias CUDA. Brook+ however offers a higher level 
of abstraction, a narrow interface to the GPU and more elegant programming. 

For all cases where we can provide analytic gradients, wc rely on the FUMILI fitter [11], as it converges with 
the fewest iterations and we do not require reliable parameter errors from the fitter in most cases. 
^ In the meantime, much more powerful hardware is available both at the CPU and the GPU end; the machine has 
however been available during the complete GPUPWA development process and thus remains an useful reference. 

Typical Fumili fits require 5-10 iterations, Minuit fits rather 50-100 (come however with useful error estimates). 
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Figure 1. Comparison of the running time of one fit iteration of the benchmark J/ip — > 
jK^K~ analysis in different implementations on the same machine. Both OpenCL datasets 
were obtained using the AMD implementation. The hardware used for the tests is described in 
section 5. 

Table 1. Time taken by an example analysis in various steps of the calculation in the OpenCL 
implementation on the GPU. For details of the set-up, see section 5. Note that the fit step 
includes initialization and cleanup and times are thus not directly comparable with those shown 
in figures 1 and 2. 



Step 


Time [s] for 
50'000 Events 


Time [s] for 
500'000 Events 


Start-up (initialization and reading of files) 


0.75 


3.67 


MC Integral 


1.01 


4.65 


Lookup table creation 


1.64 


4.92 


Fit (5 iterations) 


0.10 


0.99 


Plots 


1.48 


9.82 


Total 


4.98 


24.05 



The AMD/ATI OpenCL implementation also allows for running the code purely on the CPU, 
making use of multiple cores and vector units. On our test machine, this leads to a speed-up 
of more than one order of magnitude with regards to the FORTRAN implementation, but also 
still another order of magnitude removed from the GPU performance. 

A rough estimate on the number of floating point operations (both in single and double 
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Figure 2. Comparison of the running time of one fit iteration of the benchmark J/ijj — > 
jK^K^ analysis using the Brook+ and the OpenCL framework (in the ATI implementation) on 
the same machine/GPU. Both OpenCL datasets were obtained using the AMD implementation. 
The hardware used for the tests is described in section 5. 

precision) performed indicates that we reach on the order of 10 GFlop/s on the GPU, about 
1 % of the theoretical floating point performance of the card; most of the time is thus spent in 
memory accesses, communication over the PCI bus and the M/A'^f/JT/Fumili step on the CPU. 
As however the run time of a typical fit is dominated by reading input data and producing plots 
(see table 1 for example total timings), we do not put a high priority on further optimization of 
the fit proper. 

6. Open issues 

OpenCL finally offers a vendor independent access to these resources. OpenCL leaves however 
many details of the implementation to the vendors, possibly leading to portability issues; our 
attempt to port the OpenCL version of GPUPWA to an Apple computer running the Snow 
Leopard operating system (which natively incorporates OpenCL), was hampered by the different 
implementation of vector data types in the CPU part of the code between Apple and AMD 
(GPUPWA uses the OpenCL vector data types also in the C++ part of the code, e.g. for 
initialisation of data structures and event-independent tensor manipulations). Obviously it 
would be desirable to overcome these portability issues in the future. It would also be helpful 
for many high energy physics applications if there would be a higher level interface to the GPU 
than OpenCL, something like Brook for OpenCL. 



Most of the technical issues in PWA^ are however related to the fitting; in typical fits there 
are 20 to 100 free parameters, with all the issues that brings with it. Most of the parameters 
represent complex numbers, which are not handled well by the usual fitters; if a Cartesian 
representation is used, large correlations between the components are created if the phase is 
not well constrained, hampering fit convergence. A polar representation relies on a bounded 
and a periodical parameter, the first of which problematic for the fitter if the bound is a valid 
value, the second not handled by the fitters at all and often requiring manual intervention. A 
minimizing algorithm for complex numbers (including the proper handling of the derivatives) 
would be extremely useful here. 

7. Conclusion and outlook 

Using massively parallel computing, especially in the for of CPUs, is a powerful tool to 
overcome speed limits in partial wave analyses. The GPUPWA framework developed at the 
BES III experiment offers easy access to these resources with a pure C++ interface and all 
core functionalities provided by the framework. GPUPWA has been ported to use the OpenCL 
standard to interface parallel computing devices. The speed gain compared to a single-threaded 
non-vectorized version is more than an order of magnitude if using multiple threads and vector 
units on the CPU and more than two orders of magnitude when employing the GPU. Future 
development will focus on adding physics functionality to GPUPWA and to make the application 
truly portable. This will then also allow for hardware comparisons between the major GPU 
vendors. 

Acknowledgments 

I would like to thank my colleagues at IHEP who are testing and using GPUPWA for their 
many helpful suggestions, bug reports and additions to the framework. 

The work presented was supported by the Chinese Academy of Sciences and the Swiss 
National Science Foundation. 

References 

[1] Ablikim M et al. (BESIII) 2010 Nud. Instr. Meth. A614 345-399 {Preprint 0911.4960) 
[2] Asncr D M ei al. 2008 Physics at BES-III {Preprint arXiv: 0809. 1869 [hep-ex]) 
[3] Bai 3 Z et al. (BES) 2003 Phys. Rev. D 68 052003 {Preprint hep-ex/0307058) 

[4] Segal M and Akeley K 2008 The OpenGL graphics system: A specification (version 3.0) Tech. rep. The 
Khronos Group URL http://www.opengl.org/registry/doc/glspec30.20080811.pdf 

[5] Halfhill T R 2008 Microprocessor RepoH 01/28/08 1-6 

[6] Advanced Micro Devices 2007 AMD stream computing, software stack AMD white paper URL 

http : //at i . amd . com/ technology/ streamcomputing/f irestr eam-sdk-whitepaper . pdf 
[7] Munshi A 2008 The OpenCL specification, version 1.0 Tech. rep. The Khronos Group URL 

http : //www. khronos . org/registry/ cl/ specs/ opencl-1 . . 33 .pdf 
[8] Munshi A 2010 The OpenCL specification, version 1.1 Tech. rep. The Khronos Group URL 

http : //www. khronos . org/registry/ cl/specs/ opencl-1 . 1 .pdf 
[9] Brun R and Rademakers F 1997 Nucl. Instrum. Meth. A 389 81-86 
[10] James F and Roos M 1975 Comput. Phys. Commun. 10 343-367 
[11] FUMILI Tech. rep. CERN Prorgram Library, D510 
[12] Berger N, Liu B and Wang J 2010 J. Phys. Conf. Ser. 219 042031 

[13] 2009 The Jefferson Laboratory Upgrade to 12 GeV: Dedicated Workshop on Hadron Spectroscopy Seattle 

URL http : //www . int . Washington . edu/PROGRAMS/09-3 . html 
[14] 2011 Amplitude Analysis in Hadron Spectroscopy Trento URL http://www.ect.it/ 



® There is also a wealth of issues related to the physics, especially as to what amplitude models to employ. 
These issues at the interface between theory, experiment and computing have been discussed in depth in recent 
workshops in Seattle [13] and Trento [14]. 



